JAIS Initiative: Nile-Chat Models

Model Overview

Nile-Chat is a family of open instruction-tuned models for Egyptian dialect, developed to handle both scripts commonly used in Egypt: Arabic script and Latin-based Arabizi. As part of the Jais project for standard Arabic and its extensions to dialectal Arabic, Nile-Chat is designed to support natural language generation in a way that reflects the script-diverse nature of Egyptian communication. These models are effective for a variety of tasks including question answering, translation and transliteration. Their range of sizes ensures accessibility, from lightweight personal deployments to more powerful setups, enabling broader use of AI technologies for Egyptian Arabic speakers. The family includes two versions:

Nile-Chat-4B: A compact 4B parameter model that balances efficiency and fluency, well-suited for generating Egyptian Arabic in both Arabic and Latin scripts.
Nile-Chat-12B: A dense 12B parameter model providing high-capacity generation capabilities, ideal for complex, multi-turn interactions and nuanced understanding in both Egyptian scripts.

Paper: Coming Soon!

👥 Our Team

The model is developed by MBZUAI France Lab, an AI research center in Paris affiliated with the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) headquartered in Abu Dhabi.

Usage

Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with:

pip install -U transformers sentencepiece

Then, copy the snippet from the section below.

Running with the `pipeline` API

import torch
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="MBZUAI-Paris/Nile-Chat-12B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda" # replace with "mps" to run on a Mac device
)

Q1:

messages = [
    {"role": "user", "content": 'اسمك ايه؟'},
]

outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)

A1:

Response:

اسمي نايل-شات، على اسم نهر النيل، اطول نهر في العالم، اللي من زمان كان عامل مهم في تطور مصر، وبيساعد في معيشة الناس وأثر على التراث والثقافة بتاعتنا. وعشان انا موديل لغة، الباحثين بتوع جامعة محمد بن زايد للذكاء الاصطناعي دربوني باستخدام مجموعة من المصادر المفتوحة، وعشان كده سموني نايل-شات. لو عندك اي سؤال او محتاج مساعدة في حاجة، اتفضل اسأل!

Q2:

messages = [
    {"role": "user", "content": 'Esmak eh?'},
]
outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)

A2:

Response:

Esmi Nile-Chat, 3ala esm nahr el Nile, atwal nahr fel 3alam w mahmo kter men el hadarat 3ala tool el taree5. Ana masroor gedan bek! A2dar asa3dak ezay el naharda?

Training Data

Nile-Chat models were trained on diverse datasets focusing on Egyptian dialect consisting of approximately 3.3B tokens during continual pre-training phase, 1.9M instructions during instruction finetuning and 0.2M samples for DPO, with a maximum length of 2048 tokens, including:

Web documents: A diverse collection of Egyptian web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary.
Instruction samples created from publicly available Egyptian Arabic datasets including translation and transliteration.
Translated English and multi-lingual pretraining and instruction-tuning datasets using Claude 3.5 Sonnet (v2).

The dataset covers both Egyptian Arabic and Latin scripts. Our instruction tuning dataset Egyptian-SFT-Mixture is publicly available.

Implementation Information

Nile-Chat models are based on Gemma 3 models. The Nile-Chat models were trained using 8 NVIDIA A100 80 GB GPUs in parallel using FSDP on AWS Sagemaker. The model is trained using HuggingFace transformers and parameter-efficient fine-tuning with LoRA rank of 256 for both continual pre-training and instruction finetuning, while performing full finetuning for DPO. The continual pre-training is divided into two phases: (i) general pre-training on 2.8B tokens from the Egyptian web and (ii) annealing phase with 0.5B high quality Egyptian text.

Evaluation

Nile-Chat models were evaluated on a comprehensive suite of tasks using various datasets and benchmarks to assess their performance across multiple dimensions. These included tasks such as:

EgyptianMMLU: An Egyptian version of ArabicMMLU and MMLU benchmarks.
EgyptianHellaSwag: An Egyptian version of HellaSwag (In both scripts Arabic and Latin).
Belebele Arz_Arab: Belebele is a multiple-choice machine reading comprehension dataset published by Facebook spanning 122 language variants. The Evaluation is done on the Arz_Arab part of Belebele that refers to Egyptian Arabic.
Translation: Including four directions and three languages: Arabic script Egyptian, MSA and English.
Transliteration: Transforming a sentence from Egyptian (written in Arabic script) to Arabizi (Written in Latin script) and vice-versa.
EgyptianPIQA: An Egyptian version of PIQA benchmark (In both scripts Arabic and Latin).
EgyptianWinoGrande: An Egyptian version of WinoGrande benchmark (In both scripts Arabic and Latin).
EgyptianRACE: An Egyptian version of RACE benchmark (In both scripts Arabic and Latin).
EgyptianOpenBookQA: An Egyptian version of OpenBookQA benchmark.
EgyptianAlpacaEval: An Egyptian adaptation of AlpacaEval to assess LLM instruction-following and cultural alignment.

The models were compared against a collection of existing open-source Arabic models to gauge their effectiveness, with a particular focus on performance in Egyptian. All scores are based on zero-shot performance. The prompts are written mainly in Egyptian. We used Language Model Evaluation Harness to conduct these evaluations. All evaluations are done with applying chat template except for EgyptianWinoGrande.

Benchmarks:

Arabic Script Benchmarks

Model	Average	EgyptianMMLU	Belebele Arz	EgyptianHellaSwag	EgyptianPIQA	EgyptianWinoGrande	EgyptianOpenBookQA	EgyptianRACE High	EgyptianRACE Middle	EgyptianAlpacaEval
gemma-3-4b-it	48.76	46.08	38.56	42.56	60.32	56.49	35.79	33.68	40.06	85.30
jais-family-6p7b-chat	46.64	42.60	57.33	49.18	62.23	57.04	33.33	34.72	37.50	45.86
jais-adapted-7b-chat	42.18	40.96	55.67	40.85	56.50	54.35	32.89	34.62	42.33	21.45
Qwen2.5-7B-Instruct	49.40	45.74	64.22	45.47	58.02	56.41	38.70	35.45	41.76	58.80
ALLaM-7B-Instruct-preview	56.40	60.08	67.67	57.29	66.10	62.18	40.04	39.50	45.17	69.55
c4ai-command-r7b-arabic-02-2025	53.36	50.97	70.67	50.39	61.84	57.20	36.91	41.89	46.02	73.36
Llama-3.1-8B-Instruct	46.31	42.88	55.89	43.10	57.97	54.27	35.57	34.41	40.34	52.35
AceGPT-v2-8b-chat	58.33	55.25	73.33	53.14	62.50	58.39	39.82	41.06	47.16	93.33
gemma-2-9b-it	53.17	50.72	49.44	49.53	61.35	61.79	35.79	40.23	48.01	81.66
gemma-3-12b-it	59.70	61.55	77.00	49.49	64.96	63.53	38.03	41.27	48.86	92.61
jais-family-13b-chat	49.81	44.85	66.33	52.99	64.85	57.91	36.91	33.26	38.64	52.52
jais-adapted-13b-chat	49.80	50.03	65.33	47.53	61.30	56.72	37.14	35.45	41.76	52.91
Qwen2.5-14B-Instruct	57.34	60.81	72.33	55.84	63.97	59.97	38.26	43.25	50.28	71.35
Nile-Chat-4B	57.85	50.25	68.56	55.92	67.30	61.87	40.94	42.10	46.02	87.65
Nile-Chat-12B	64.11	62.59	79.44	64.04	70.69	63.53	42.06	48.02	53.13	93.50

Latin Script Benchmarks

Model	Average	EgyptianHellaSwag	EgyptianPIQA	EgyptianWinoGrande	EgyptianRACE High	EgyptianRACE Middle
gemma-3-4b-it	36.93	30.90	52.76	48.57	25.47	26.94
jais-family-6p7b-chat	37.58	30.27	53.25	52.14	24.18	28.06
jais-adapted-7b-chat	37.06	30.81	51.67	50.40	24.38	28.06
Qwen2.5-7B-Instruct	36.87	30.51	51.88	50.95	24.88	26.11
ALLaM-7B-Instruct-preview	38.58	32.17	53.09	50.63	25.07	31.94
c4ai-command-r7b-arabic-02-2025	37.38	30.88	52.32	51.43	25.07	27.22
Llama-3.1-8B-Instruct	37.62	31.77	53.30	50.24	24.48	28.33
AceGPT-v2-8b-chat	38.77	33.16	53.80	50.24	26.07	30.56
gemma-2-9b-it	38.70	33.75	53.69	50.79	26.66	28.61
gemma-3-12b-it	41.63	37.52	53.14	51.19	31.02	35.28
jais-family-13b-chat	36.96	30.46	53.09	48.18	25.28	27.78
jais-adapted-13b-chat	36.98	31.14	52.87	50.79	23.98	26.11
Qwen2.5-14B-Instruct	39.48	33.49	52.87	53.41	27.35	30.28
Nile-Chat-4B	51.38	50.55	65.32	60.62	37.36	43.06
Nile-Chat-12B	53.88	53.71	65.10	59.98	41.72	48.89

Translation and Transliteration Tasks:

Model	Long Translation			Short Translation			Transliteration
Model	BLEU	chrF	BERTScore	BLEU	chrF	BERTScore	BLEU	chrF	BERTScore
gemma-3-4b-it	20.67	44.75	73.03	04.76	31.15	52.98	01.44	20.36	47.54
jais-family-6p7b-chat	12.71	36.53	68.07	08.73	31.52	56.78	00.70	10.64	42.51
jais-adapted-7b-chat	10.61	27.56	63.48	09.19	24.85	53.52	01.11	06.14	40.45
Qwen2.5-7B-Instruct	19.89	44.80	73.64	11.34	36.31	54.96	02.74	20.63	49.32
ALLaM-7B-Instruct-preview	26.57	52.59	78.34	25.20	48.12	65.97	02.10	18.92	49.42
c4ai-command-r7b-arabic-02-2025	25.18	50.26	77.97	23.30	45.34	65.20	03.52	24.57	50.49
Llama-3.1-8B-Instruct	12.90	32.58	68.76	09.06	28.56	54.19	03.26	17.55	48.71
AceGPT-v2-8b-chat	24.59	49.39	77.57	22.47	44.97	66.30	04.80	23.52	49.33
gemma-2-9b-it	23.09	46.98	75.42	11.73	39.00	60.42	02.68	24.28	48.26
gemma-3-12b-it	22.90	45.97	73.46	05.24	32.82	54.34	02.77	26.16	50.47
jais-family-13b-chat	10.41	31.98	64.15	08.64	30.10	57.00	00.84	11.35	44.71
jais-adapted-13b-chat	15.53	41.48	70.86	15.96	38.81	63.52	01.00	13.33	46.08
Qwen2.5-14B-Instruct	21.71	45.55	73.36	09.26	34.21	53.89	04.07	25.83	51.41
Nile-Chat-4B	37.49	58.40	84.30	30.35	52.01	74.07	51.46	80.44	89.59
Nile-Chat-12B	40.53	60.61	85.45	32.2	53.53	74.72	52.21	80.97	89.71

Usage and Limitations

These models have certain limitations that users should be aware of.

Intended Usage

Open Large Language Models (LLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.

Content Creation and Communication
- Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
- Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
- Text Summarization: Generate concise summaries of a text corpus, research papers, or reports.
Research and Education
- Natural Language Processing (NLP) Research: These models can serve as a foundation for researchers to experiment with NLP techniques, develop algorithms, and contribute to the advancement of the field.
- Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
- Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.

Limitations

Training Data

The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
The scope of the training dataset determines the subject areas the model can handle effectively.

Context and Task Complexity

LLMs perform better on tasks framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).

Language Ambiguity and Nuance

Natural language is inherently complex. LLMs might struggle to grasp subtle nuances, sarcasm, or figurative language.

Factual Accuracy

LLMs generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.

Common Sense

LLMs rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.

Ethical Considerations and Risks

The development of large language models (LLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:

Bias and Fairness
- LLMs trained on large-scale, real-world text data can reflect socio-cultural biases embedded in the training material.
Misinformation and Misuse
- LLMs can be misused to generate text that is false, misleading, or harmful.
- Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit].
Transparency and Accountability:
- This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes.
- A responsibly developed open model offers the opportunity to share innovation by making LLM technology accessible to developers and researchers across the AI ecosystem.

Risks identified and mitigations:

Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.
Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
Privacy violations: Models were trained on data filtered for removal of PII (Personally Identifiable Information). Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.

MBZUAI-Paris
/

Nile-Chat-12B

JAIS Initiative: Nile-Chat Models

Model Overview

👥 Our Team

Usage

Running with the `pipeline` API

Training Data

Implementation Information

Evaluation

Benchmarks:

Arabic Script Benchmarks

Latin Script Benchmarks

Translation and Transliteration Tasks:

Usage and Limitations

Model tree for MBZUAI-Paris/Nile-Chat-12B

Dataset used to train MBZUAI-Paris/Nile-Chat-12B

Collection including MBZUAI-Paris/Nile-Chat-12B

Nile-Chat

JAIS Initiative: Nile-Chat Models

Model Overview

👥 Our Team

Usage

Running with the pipeline API

Training Data

Implementation Information

Evaluation

Benchmarks:

Arabic Script Benchmarks

Latin Script Benchmarks

Translation and Transliteration Tasks:

Usage and Limitations

Model tree for MBZUAI-Paris/Nile-Chat-12B

Dataset used to train MBZUAI-Paris/Nile-Chat-12B

Collection including MBZUAI-Paris/Nile-Chat-12B

Running with the `pipeline` API